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17 Random Variables and Distributions 

Thus far, we have focused on probabilities of events. For example, we computed 
the probability that you win the Monty Hall game, or that you have a rare medical 
condition given that you tested positive. But, in many cases we would like to more 
more. For example, how many contestants must play the Monty Hall game until 
one of them finally wins? How long will this condition last? How much will I lose 
gambling with strange dice all night? To answer such questions, we need to work 
with random variables. 



17.1 Definitions and Examples 

Definition 17.1.1. A random variable Ron a probability space is a total function 
whose domain is the sample space. 

The codomain of R can be anything, but will usually be a subset of the real 
numbers. Notice that the name "random variable" is a misnomer; random variables 
are actually functions ! 

For example, suppose we toss three independent^ unbiased coins. Let C be the 
number of heads that appear. Let M = 1 if the three coins come up all heads or 
all tails, and let M = otherwise. Every outcome of the three coin flips uniquely 
determines the values of C and M. For example, if we flip heads, tails, heads, then 
C = 2 and M = 0. If we flip tails, tails, tails, then C — and M = L In effect, 
C counts the number of heads, and M indicates whether all the coins match. 

Since each outcome uniquely determines C and M, we can regard them as func- 
tions mapping outcomes to numbers. For this experiment, the sample space is 

S = {HHH, HH T, HTH,HT T, THH, TH T, TTH,TT T} 

and C is a function that maps each outcome in the sample space to a number as 

'Going forward, when we talk about flipping independent coins, we will assume that they are 
mutually independent. 
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follows: 

C{HHH) = 3 C(THH) = 2 

C{HHT) = 2 C{THT) = 1 

C(HTH) = 2 C(TTH) = 1 

C(HTT) = l C(TTT) = 0. 

Similarly, M is a function mapping each outcome another way: 

M(HHH) = 1 M(THH) = 

M(HHT) = M(THT) = 

M(HTH) = M{TTH) = 

M{HTT) = M{TTT) = 1. 

So C and M are random variables. 

17.1.1 Indicator Random Variables 

An indicator random variable is a random variable that maps every outcome to 
either or 1 . Indicator random variables are also called Bernoulli variables. The 
random variable M is an example. If all three coins match, then M = I; otherwise, 
M = 0. 

Indicator random variables are closely related to events. In particular, an in- 
dicator random variable partitions the sample space into those outcomes mapped 
to 1 and those outcomes mapped to 0. For example, the indicator M partitions the 
sample space into two blocks as follows: 

HHH TTT HHT HTH HTT THH THT TTH . 

^ ' " ^ ' 

M = 1 M = 

In the same way, an event E partitions the sample space into those outcomes in E 
and those not in E. So £ is naturally associated with an indicator random variable, 

Ie, where Ie(w) = 1 for outcomes w € E and Ie(w) = for outcomes w ^ E. 
Thus, M — Ie where E is the event that all three coins match. 

17.1.2 Random Variables and Events 

There is a strong relationship between events and more general random variables 
as well. A random variable that takes on several values partitions the sample space 
into several blocks. For example, C partitions the sample space as follows: 

TTT^ TTH THT HTT THH HTH HHT HHJi. 

C = C = l C =2 C = 3 
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Each block is a subset of the sample space and is therefore an event. Thus, we 
can regard an equation or inequality involving a random variable as an event. For 
example, the event that C = 2 consists of the outcomes THH, HTH, and HHT. 
The event C < 1 consists of the outcomes TTT, TTH, THT, and HTT. 

Naturally enough, we can talk about the probability of events defined by proper- 
ties of random variables. For example, 

Pr[C = 2] = Pr[THH] + Pr[HTH] + Pr[HHT] 
1 1 1 



As another example: 

Pr[M = 1] = Pr[TTT] + Pt[HHH] 
1 1 
= 8 + 8 
_ 1 
~ 4' 

17.1.3 Functions of Random Variables 

Random variables can be combined to form other random variables. For exam- 
ple, suppose that you roll two unbiased, independent 6-sided dice. Let Dj be the 
random variable denoting the outcome of the ith die for / = 1,2. For example, 

Pt[Di = 3] = 1/6. 

Then letT = Di + D2. T is also a random variable and it denotes the sum of the 
two dice. For example, 

Pr[r = 2] = 1/36 

and 

Pr[r = 7] = 1/6. 

Random variables can be combined in compUcated ways, as we will see in Chap- 
ter 19. For example, 

Y = e^ 

is also a random variable. In this case, 

Pr[y = e^] = 1/36 

and 
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Pr[7 = e^] = 1/6. 

17.1.4 Conditional Probability 

Mixing conditional probabilities and events involving random variables creates no 
new difficulties. For example, Pr [C > 2 | M = O] is the probability that at least 
two coins are heads (C > 2) given that not all three coins are the same (M = 0). 
We can compute this probabihty using the definition of conditional probability: 

r , T Pr[C > 2 n M = 01 

FrTC > 2 M = 0] = ^ ~ 

- ' J Pr[M = 0] 

_ Pr[{THH,HTH,HHT}] 

~ Pr[{THH, HTH, HHT, HTT, THT, TTH}] 

_ 3/8 

~ 678 

_ 1 

~ 2' 

The expression C > 2 (1 M = 0on the first line may look odd; what is the set 
operation fl doing between an inequality and an equality? But recall that, in this 
context, C > 2 and M = are events, and so they are sets of outcomes. 

17.1.5 Independence 

The notion of independence carries over from events to random variables as well. 
Random variables Ri and R2 are independent iff for all xi in the codomain of Ri, 
and X2 in the codomain of R2 for which Pr[/?2 = ^2] > 0, we have: 

Pr[Ri ^xi\ R2 = X2] = Pr[Ri = xi]. 

As with events, we can formulate independence for random variables in an equiva- 
lent and perhaps more useful way: random variables Ri and R2 are independent if 
for all xi and X2 

Pt[Ri = XI n i?2 = X2] = Pr[Ri = xi] ■ Pv[R2 = X2]. 

For example, are C and M independent? Intuitively, the answer should be "no". 
The number of heads, C, completely determines whether all three coins match; that 
is, whether M = 1. But, to verify this intuition, we must find some Xi, X2 € M 
such that: 

Pr[C = xi n M = X2] 7^ Pr[C = xi] • Pr[M = X2]. 
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One appropriate choice of values is xi =2 and X2 = 1. In this case, we have: 

Pr[C = 2nM = l] = 

and 

Pr[M = l].Pr[C = 2] = ^.^^0. 

The first probability is zero because we never have exactly two heads (C = 2) 
when all three coins match (M = 1). The other two probabihties were computed 
earlier. 

On the other hand, let F be the indicator variable for the event that the first flip 
is a Head, so 

"F = 1" = {HHH, HTH, HHT, HTT). 
Then F is independent of M, since 

Pr[M = 1] = 1/4 = Pr [M = 1 I F = l] = Pr [M = 1 | F = O] 

and 

Pr[M = 0] = 3/4 = Pr [Af = I F = l] = Pr [M = | F = O] . 
This example is an instance of a simple lemma: 

Lemma 17.1.2. Two events are independent iff their indicator variables are inde- 
pendent. 

As with events, the notion of independence generaUzes to more than two random 

variables. 

Definition 17.1.3. Random variables i?2, • • • , are mutually independent iff 

Pt[Ri = xi n ^2 = x2 n • • • n /?„ = x„] 

= Pr[Ri=xi]--pr[R2 = X2]----pT[R„=Xn]. 
forallxi,X2 Xn. 

A consequence of Definition 17.1.3 is that the probabiUty that any subset of the 
variables takes a particular set of values is equal to the product of the probabilities 

that the individual variables take their values. Thus, for example, if i?i , i?2 ^loo 

are mutually independent random variables, then it follows that: 

Pr[/?i = 7 n i?7 = 9.1 n R23 = 7t] 

= Fr[Ri = 7] • Pr[/?7 = 9.1] • Pr[i?23 = 

The proof is based on summing over all possible values for all of the other random 
variables. 
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17.2 Distribution Functions 

A random variable maps outcomes to values. Often, random variables that show up 
for different spaces of outcomes wind up behaving in much the same way because 
they have the same probability of having any given value. Hence, random variables 
on different probability spaces may wind up having the same probability density 

function. 

Definition Let be a random variable with codomain V . The probability 

density function (pdf) of i? is a function PDF^ : V [0, 1] defined by: 

Pt[R = x] if X € range(/?) 
if X ^ range (/?). 

A consequence of this definition is that 

x€range(/J) 

This is because R has a value for each outcome, so summing the probabilities over 
all outcomes is the same as summing over the probabilities of each value in the 
range of R. 

As an example, suppose that you roll two unbiased, independent, 6-sided dice. 
Let T be the random variable that equals the sum of the two rolls. This random 
variable takes on values in the set V — {2,3,..., 12}. A plot of the probability 
density function for T is shown in Figure 17.1: The lump in the middle indicates 
that sums close to 7 are the most likely. The total area of all the rectangles is 1 
since the dice must take on exactly one of the sums in K = {2, 3, . . . , 12}. 

A closely -related concept to a PDF is the cumulative distribution function (cdf) 
for a random variable whose codomain is the real numbers. This is a function 
CDFr : M ^ [0, 1] defined by: 

CDFi;(x) = Pr[i? < x]. 

As an example, the cumulative distribution function for the random variable T 
is shown in Figure 17.2: The height of the ith bar in the cumulative distribution 
function is equal to the sum of the heights of the leftmost i bars in the probability 



PDF«(x) :: = 



6 



"mcs-ftr' — 2010/9/8 — 0:40 — page 451 — #457 



17.2. Distribution Functions 



6/36 -- 

PDFx(x) 

3/36 -- 



2 3 4 5 6 7 8 9 10 11 12 
Figure 17.1 The probability density function for the sum of two 6-sided dice. 
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density function. This follows from the definitions of pdf and cdf: 

CT>¥r{x) = Vr[R < x] 

= ^Pr[/? = j] 

y<x 

y<x 

In summary, PDFr(x) measures the probability that R = x and CT)Vr{x) 
measures the probability that R < x. Both PDF/j and CDF^ capture the same 
information about the random variable R — you can derive one from the other — ^but 
sometimes one is more convenient. 

One of the really interesting things about density functions and distribution func- 
tions is that many random variables turn out to have the same pdf and cdf. In other 
words, even though R and S are different random variables on different probabihty 
spaces, it is often the case that 

PDF/j = PDF,. 

In fact, some pdfs are so common that they are given special names. For exam- 
ple, the three most important distributions in computer science are the Bernoulli 
distribution, the uniform distribution, and the binomial distribution. We look more 
closely at these common distributions in the next several sections. 



17.3 Bernoulli Distributions 



The Bernoulli distribution is the simplest and most common distribution func- 
tion. That's because it is the distribution function for an indicator random vari- 
able. Specifically, the Bernoulli distribution has a probabihty density function of 
the form : {0, 1} ^ [0, 1] where 

fpiO) = p, and 

for some p e [0, 1]. The corresponding cumulative distribution function is Fp : 
M [0, 1] where: 

'O ifx<0 
l^p(x) — {p ifO<x<l 
1 if 1< X. 
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11 A Uniform Distributions 



17.4.1 Definition 



A random variable that takes on each possible value with the same probability is 
said to be uniform. If the sample space is {1, 2, . . . , «}, then the uniform distribu- 
tion has a pdf of the form 

/„:{1,2,...,«}^[0, 1] 

where 

fn{k) = - 

n 

for some n € N"*". The cumulative distribution function is then F„ : M — ^ [0, 1] 
where 

if X < 1 

F„(x) = < k/n ifk<x<k + l{oTl<k<n 

1 if n < X. 

Uniform distributions arise frequently in practice. For example, the number rolled 
on a fair die is uniform on the set {1, 2, . . . , 6}. If p = 1/2, then an indicator 
random variable is uniform on the set {0, 1}. 

17.4.2 The Numbers Game 

Enough definitions — ^let's play a game! I have two envelopes. Each contains an in- 
teger in the range 0,1,..., 100, and the numbers are distinct. To win the game, you 
must determine which envelope contains the larger number. To give you a fighting 
chance, we'll let you peek at the number in one envelope selected at random. Can 
you devise a strategy that gives you a better than 50% chance of winning? 

For example, you could just pick an envelope at random and guess that it contains 
the larger number. But this strategy wins only 50% of the time. Your challenge is 
to do better. 

So you might try to be more clever. Suppose you peek in one envelope and see 

the number 12. Since 12 is a small number, you might guess that that the number in 
the other envelope is larger. But perhaps we've been tricky and put small numbers 
in both envelopes. Then your guess might not be so good! 

An important point here is that the numbers in the envelopes may not be random. 
We're picking the numbers and we're choosing them in a way that we think will 
defeat your guessing strategy. We'll only use randomization to choose the numbers 
if that serves our purpose, which is to make you lose! 
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Intuition Behind the Winning Strategy 

Amazingly, there is a strategy that wins more than 50% of the time, regardless of 
what numbers we put in the envelopes! 

Suppose that you somehow knew a number x that was in between the numbers 
in the envelopes. Now you peek in one envelope and see a number. If it is bigger 
than X, then you know you're peeking at the higher number. If it is smaller than x, 
then you're peeking at the lower number. In other words, if you know a number x 
between the numbers in the envelopes, then you are certain to win the game. 

The only flaw with this brilliant strategy is that you do not know such an x. Oh 
well. 

But what if you try to guess x? There is some probability that you guess cor- 
rectly. In this case, you win 100% of the time. On the other hand, if you guess 
incorrectly, then you're no worse off than before; your chance of winning is still 
50%. Combining these two cases, your overall chance of winning is better than 
50%! 

Informal arguments about probability, like this one, often sound plausible, but 
do not hold up under close scrutiny. In contrast, this argument sounds completely 
implausible — ^but is actually correct! 

Analysis of the Winning Strategy 

For generality, suppose that we can choose numbers from the set {0, 1, ... , n). Call 
the lower number L and the higher number H . 

Your goal is to guess a number x between L and H. To avoid confusing equality 
cases, you select x at random from among the half-integers: 

111 1 

-, 1-, 2-, 

2 2 2 2 

But what probability distribution should you use? 

The uniform distribution turns out to be your best bet. An informal justification 
is that if we figured out that you were unlikely to pick some number — say 50^ — 
then we'd always put 50 and 51 in the envelopes. Then you'd be unlikely to pick 
an X between L and H and would have less chance of winning. 

After you've selected the number x, you peek into an envelope and see some 
number T . If T > x, then you guess that you're looking at the larger number. 
If r < X, then you guess that the other number is larger. 

All that remains is to determine the probability that this strategy succeeds. We 
can do this with the usual four step method and a tree diagram. 
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Step 1: Find the sample space. 








You either choose x too low (< L), too high (> H), or just right (L < x < H). 


Then you either peek at the lower number {T = 


L) or the higher number (T — H). 


This gives a total of six possible outcomes, as show in Figure 17.3. 


choices 


number 


result 


probability 


of X 


peeked at 








T=L 1/2, 


» lose 


L/2n 


X too low y 






L/2n 


L/n / 


T=Hl/2~* 


» win 




T=L Xjl^ 


» win 


{H-L)/2n 


X just right 








'\A.H-L)ln 










T=H li2~* 


» win 


(H-L)/2n 


(n-H)/n\ 


T=L Ijl^ 


» win 


{n-H)/2n 


X too high \ 










T=H 1/2^ 


1 lose 


{n-H)/2n 


Figure 17.3 The tree diagram for the numbers game. 


Step 2: Define events of interest. 








The four outcomes in the event that you win are marked in the tree diagram. 


Step 3: Assign outcome probabilities. 






First, we assign edge probabiUties. Your guess x is too low with probability L/n, 


too high with probability (n — 


H)/n, and just 


right with probability {H — L)/n. 


Next, you peek at either the lower or higher number with equal probability. Multi- 


plying along root-to-leaf paths gives the outcome probabilities. 
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Step 4: Compute event probabilities. 

The probability of the event that you win is the sum of the probabilities of the four 
outcomes in that event: 

Pr[win] - ^ -L -L ^n-H 

2n 2n 2n 2n 

1 H-L 



2 2n 

1 1 

2 2n 

The final inequahty relies on the fact that the higher number H is at least 1 greater 
than the lower number L since they are required to be distinct. 

Sure enough, you win with this strategy more than half the time, regardless of 
the numbers in the envelopes! For example, if 1 choose numbers in the range 
0, 1, . . . , 100, then you win with probability at least | + ~ 50.5%. Even 
better, if I'm allowed only numbers in the range 0, . . . , 10, then your probability of 
winning rises to 55%! By Las Vegas standards, those are great odds! 

17.4.3 Randomized Algorithms 

The best strategy to win the numbers game is an example of a randomized algo- 
rithm — ^it uses random numbers to influence decisions. Protocols and algorithms 
that make use of random numbers are very important in computer science. There 
are many problems for which the best known solutions are based on a random num- 
ber generator. 

For example, the most commonly-used protocol for deciding when to send a 
broadcast on a shared bus or Ethernet is a randomized algorithm known as expo- 
nential backoff. One of the most commonly-used sorting algorithms used in prac- 
tice, called quicksort, uses random numbers. You'll see many more examples if 
you take an algorithms course. In each case, randomness is used to improve the 
probability that the algorithm runs quickly or otherwise performs well. 



17.5 Binomial Distributions 

17.5.1 Definitions 

The third commonly-used distribution in computer science is the binomial distri- 
bution. The standard example of a random variable with a binomial distribution is 
the number of heads that come up in n independent flips of a coin. If the coin is 
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Figure 17.4 The pdf for the unbiased binomial distribution for n = 20, f2oik). 

fair, then the number of heads has an unbiased binomial distribution, specified by 
the pdf 

/„:{l,2,...,n}^[0, 1] 



where 



for some n e N"*". This is because there are (^) sequences of n coin tosses with 
exactly k heads, and each such sequence has probability 2"" . 

A plot of f2o(k) is shown in Figure 17.4. The most likely outcome is = 10 
heads, and the probability falls off rapidly for larger and smaller values of k. The 
falloff regions to the left and right of the main hump are called the tails of the 
distribution. We'll talk a lot more about these tails shortly. 

The cumulative distribution function for the unbiased binomial distribution is 
F„:R-^ [0, 1] where 



Fn(x) = < 







if X < 1 



J2i=Q{1)2-" ifk<x<k + lforl<k<n 
1 if n < X. 
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0.25 

0.2 - 



0.15 - 



/20,.75(^) 



0.1 - 



0.05 - 




Figure 17.5 The pdf for the general binomial distribution fn,p(k) for « = 20 
and p = .75. 



The General Binomial Distribution 

If the coins are biased so that each coin is heads with probabiUty p, then the number 
of heads has a general binomial density function specified by the pdf 



where 



fn,p:{l,2,...,n}^[0, 1] 



fnAk)^(l]p'i^-pr-'. 



for some n e N"*" and p e [0, 1]. This is because there are (^) sequences with 
k heads and n—k tails, but now the probability of each such sequence is p^(l — 

py-^. 

For example, the plot in Figure 17.5 shows the probability density function 
fn,p{k) corresponding to flipping « = 20 independent coins that are heads with 
probability p = 0.75. The graph shows that we are most likely to get k = \5 
heads, as you might expect. Once again, the probability falls off quickly for larger 
and smaller values of k. 
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The cumulative distribution function for the general binomial distribution is F„ : 
. [0, 1] where 



Fn,p{x) = < 



if X < 1 

Ef=o (") P' (1 - pT~' i{k<x<k + lforl<k<n (17.1) 

1 if « < X. 



17.5.2 Approximating the Probability Density Function 

Computing the general binomial density function is daunting when k and n are 
large. Fortunately, there is an approximate closed-form formula for this function 
based on an approximation for the binomial coefficient. In the formula below, k is 
replaced by a« where a is a number between and 1. 



Lemma 17.5.1. 

2nH(a) 



and 

2nH(ci) 




^/2na{l — a)n 
where H{a) is the entropy function^ 



(17.2) 



(17.3) 



H{a) ::= a log + (1 - a) log (^^Z 



a 



Moreover, if an > 10 and (1 — a)n > 10, then the left and right sides of Equa- 
tion 17.2 differ by at most 2%. If an > \QOand{\—a)n > ICQ, then the difference 
is at most 0.2%. 

The graph of H is shown in Figure 17.6. 

Lemma (17.5.1) provides an excellent approximation for binomial coefficients. 
We'll skip its derivation, which consists of plugging in Theorem 9.6.1 for the fac- 
torials in the binomial coefficient and then simplifying. 

Now let's plug Equation 17.2 into the general binomial density function. The 
probability of flipping an heads in n tosses of a coin that comes up heads with 

^\og{x) means log2(x). 
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0.8 - 



0.6 - 




0.4 - 



0.2 - 



Figure 17.6 The Entropy Function 



probability p is: 



fn,p{otn) 



2nH(a) pan^^ _ ^^(l-a)n 

y/2Tta{\ — a)n 

2n(<.log(^) + (l-c.)log(i^)) 

■^2na{l — a)n 



(17.4) 



where the margin of error in the approximation is the same as in Lemma 17.5.1. 
From Equation 17.3, we also find that 



/n,p(a«) < 



2n(alog(f )+(!-«) log(iEf)) 
yJlTiaiX — a)n 



(17.5) 



The formula in Equations 17.4 and 17.5 is as ugly as a bowling shoe, but it's 
useful because it's easy to evaluate. For example, suppose we flip a fair coin n 
times. What is the probability of getting exactly pn heads? Plugging a = p 
into Equation 17.4 gives: 



fn,p{pn) 



1 



y/lnp{\ — p)n 
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Thus, for example, if we flip a fair coin (where p = 1/2) n = 100 times, the 
probability of getting exactly 50 heads is within 2% of 0.079, which is about 8%. 

17.5.3 Approximating the Cumulative Distribution Function 

In many fields, including computer science, probabiUty analyses come down to get- 
ting small bounds on the tails of the binomial distribution. In a typical appUcation, 
you want to bound the tails in order to show that there is very small probability that 
too many bad things happen. For example, we might like to know that it is very 
unlikely that too many bits are corrupted in a message, or that too many servers or 
communication links become overloaded, or that a randomized algorithm runs for 
too long. 

So it is usually good news that the binomial distribution has small tails. To 
get a feel for their size, consider the probabihty of flipping at most 25 heads in 
100 independent tosses of a fair coin. 

The probability of getting at most an heads is given by the binomial cumulative 
distribution function 




(17.6) 



We can bound this sum by bounding the ratio of successive terms. 
In particular, for j ■<an. 





n\p'{\- pY 
i\{n-i)\ 

i{l-p) 



n-l 



{n-i + \)p 
an(l — p) 



{n — an + \)p 



^ a{l-p) 
- {\-a)p- 
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This means that for a < p. 



Fn,p("n) < fn,p(oin)J2 

1=0 

fn,p((xn) 



lil-a)pj 



il-a)p 

In other words, the probability of at most an heads is at most 

l-a 
l-a/p 

times the probability of exactly an heads. For our scenario, where p — 1/2 and 
a = 1/4, 

l-a _ 3/4 _ 3 
l-a/p ~ Y/2~ 2' 

Plugging n = 100, a = 1/4, and p = 1/2 into Equation 17.5, we find that the 
probability of at most 25 heads in 100 coin flips is 

3 2l00(ilog(2)+|log(|)) 

^-•"^<^^^2 

This says that flipping 25 or fewer heads is extremely unlikely, which is consis- 
tent with our earlier claim that the tails of the binomial distribution are very small. 
In fact, notice that the probability of flipping 25 or fewer heads is only 50% more 
than the probability of flipping exactly 25 heads. Thus, flipping exactly 25 heads is 
twice as Ukely as flipping any number between and 24! 

Caveat. The upper bound on F„^p(an) in Equation 17.7 holds only if a < p. If 

this is not the case in your problem, then try thinking in complementary terms; that 
is, look at the number of tails flipped instead of the number of heads. In fact, this 
is precisely what we will do in the next example. 

17.5.4 Noisy Channels 

Suppose you are sending packets of data across a communication channel and that 
each packet is lost with probability p = .01. Also suppose that packet losses are 
independent. You need to figure out how much redundancy (or error correction) to 
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build into your communication protocol. Since redundancy is expensive overheard, 
you would like to use as little as possible. On the other hand, you never want to be 
caught short. Would it be safe for you to assume that in any batch of 10,000 packets, 
only 200 (or 2%) are lost? Let's find out. 

The noisy channel is analogous to flipping n = 10,000 independent coins, each 
with probability p = .01 of coming up heads, and asking for the probability that 
there are at least an heads where a = .02. Since a > we cannot use Equa- 
tion 17.7. So we need to recast the problem by looking at the numbers of tails. In 
this case, the probability of tails is = .99 and we are asking for the probability 
of at most an tails where a = .98. 

Now we can use Equations 17.5 and 17.7 to find that the probability of losing 2% 
or more of the 10,000 packets is at most 



This is good news. It says that planning on at most 2% packet loss in a batch of 
10,000 packets should be very safe, at least for the next few millennia. 

17.5.5 Estimation by Sampling 

Sampling is a very common technique for estimating the fraction of elements in 
a set that have a certain property. For example, suppose that you would like to 
know how many Americans plan to vote for the Republican candidate in the next 
presidential election. It is infeasible to ask every American how they intend to 
vote, so pollsters will typically contact n Americans selected at random and then 
compute the fraction of those Americans that will vote Republican. This value is 
then used as the estimate of the number of all Americans that will vote Republican. 
For example, if 45% of the n contacted voters report that they will vote Repubhcan, 
the pollster reports that 45% of all Americans will vote Republican. In addition, 
the pollster will usually also provide some sort of qualifying statement such as 

"There is a 95% probability that the poll is accurate to within ±4 per- 
centage points." 

The qualifying statement is often the source of confusion and misinterpretation. 
For example, many people interpret the qualifying statement to mean that there is 
a 95% chance that between 41% and 49% of Americans intend to vote Republican. 
But this is wrong! The fraction of Americans that intend to vote Republican is a 
fixed (and unknown) value p that is not a random variable. Since p is not a random 
variable, we cannot say anything about the probability that Al < p < .49. 
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To obtain a correct interpretation of the qualifying statement and the results of 
the poll, it is helpful to introduce some notation. 

Define Ri to be the indicator random variable for the /th contacted American in 
the sample. In particular, set Ri = 1 if the /th contacted American intends to vote 
Republican and Ri = otherwise. For the purposes of the analysis, we will assume 
that the ith contacted American is selected uniformly at random (with replacement) 
from the set of all Americans.-' We will also assume that every contacted person 
responds honestly about whether or not they intend to vote RepubUcan and that 
there are only two options — each American intends to vote Republican or they 
don't. Thus, 

Pr[Ri = l] = p (17.8) 

where p is the (unknown) fraction of Americans that intend to vote Repubhcan. 
We next define 

T = Ri + R2 + --- + Rn 

to be the number of contacted Americans who intend to vote RepubUcan. Then 
T/n is a random variable that is the estimate of the fraction of Americans that 
intend to vote Republican. 

We are now ready to provide the correct interpretation of the quahfying state- 
ment. The poll results mean that 

Pr[|r/« < .04] > .95. (17.9) 

In other words, there is a 95% chance that the sample group will produce an esti- 
mate that is within ±4 percentage points of the correct value for the overall popu- 
lation. So either we were "unlucky" in selecting the people to poll or the results of 
the poll will be correct to within ±4 points. 

How Many People Do We Need to Contact? 

There remains an important question: how many people n do we need to contact to 
make sure that Equation 17.9 is true? In general, we would like n to be as small as 
possible in order to minimize the cost of the poll. 

Surprisingly, the answer depends only on the desired accuracy and confidence of 
the poll and not on the number of items in the set being sampled. In this case, the 
desired accuracy is .04, the desired confidence is .95, and the set being sampled is 
the set of Americans. It's a good thing that n won't depend on the size of the set 
being sampled — there are over 300 milUon Americans ! 

^This means that someone could be contacted multiple times. 



20 



"mcs-ftr' — 2010/9/8 — 0:40 — page 465 — #471 



17.5. Binomial Distributions 



The task of finding an n that satisfies Equation 17.9 is made tractable by observ- 
ing that T has a general binomial distribution with parameters n and p and then 
applying Equations 17.5 and 17.7. Let's see how this works. 

Since we will be using bounds on the tails of the binomial distribution, we first 
do the standard conversion 

Vv[\T/n- p\< .QA] = 1 -Pr[|r/n -;?| > .04] . 

We then proceed to upper bound 

VT[\T/n-p\> .04] = Pr[r < {p - .04)«] + Vr[T > {p + .04)«] 

= FnAiP - 0.4)«) + F„,i_p((l -p- .04)«). (17.10) 

We don't know the true value of p, but it turns out that the expression on the 
righthand side of Equation 17.10 is maximized when p — \/2 and so 

Pr[|r/n-;p| >.04]<2F„,i/2(.46«) 

< 2 I 1 /„ i/2C46«) 

VI - (.467.5)7 -^"'^'^^ ^ 

2«(-461og(:^)+.541og(:^)) 

< 13.5 

V27r • 0.46 • 0.54 • n 

10.81 •2--00462n 
< ^ . (17.11) 



The second line comes from Equation 17.7 using a = .46. The third line comes 
from Equation 17.5. 

Equation 17.11 provides bounds on the confidence of the poll for different values 

of n . For example, if « = 665, the bound in Equation 17.11 evaluates to .04978 

Hence, if the pollster contacts 665 Americans, the poll will be accurate to within 
±4 percentage points with at least 95% probability. 

Since the bound in Equation 17.11 is exponential in n, the confidence increases 
greatly as n increases. For example, if « = 6,650 Americans are contacted, the poll 
will be accurate to within ±4 points with probability at least 1 — 10"^". Of course, 
most pollsters are not willing to pay the added cost of polling 10 times as many 
people when they aheady have a confidence level of 95% from polling 665 people. 
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